09. Missing Values and Outliers
Missing Values and Outliers
ND320 AIHCND C01 L01 A09 Analyzing Dataset For Missing Values And Imputing Methods
Missing Values and Outliers
Missing values are especially common in healthcare where you may have incomplete records or some fields are sparsely populated
Missing Data Classification
MCAR which stands for Missing Completely at Random. This means that the data is missing due to something unrelated to the data and there is no systematic reason for the missing data. In other words, there is an equal probability that data is missing for all cases. This is often due to some instrumentation like a broken instrument or process issue where some of the data is randomly missing.
MAR refers to Missing at Random and this is the opposite case where there is some systematic relationship between data and the probability of missing data. For example, there might be some missing demographics choices in surveys.
MNAR is a Missing Not at Random and this usually means there is a relationship between a value in the dataset and the missing values.
Understanding why data is missing help with choosing the best imputing method to fill or drop the values in your dataset.
Code Concepts
Create a function to check the percent of missing and zero values you have.
def check_for_missing_and_null(df):
null_df = pd.DataFrame({'columns': df.columns,
'percent_null': df.isnull().sum() * 100 / len(df),
'percent_zero': df.isin([0]).sum() * 100 / len(df)
} )
return null_df
Apply that function to the original dataframe
check_for_missing_and_null(dataframe)
View the results and see if there are any values that stand out. Again you may need to deal with different columns in different ways depending on their type and reason for missing or zero values.
Additional Resources
Code
If you need a code on the https://github.com/udacity.
Missing and Zero Values
SOLUTION:
Finding the percentage of missing and zero values can help inform whether to impute or drop values or fields.Missing Data Classification
QUIZ QUESTION::
Match the correct term to a description.
ANSWER CHOICES:
Description |
Term |
---|---|
Women could be less likely to give their weight on a survey. |
|
White cell value Data is missing because a testing machine was improperly calibrated. |
|
Those with low education are not accounted for in a study. |
SOLUTION:
Description |
Term |
---|---|
Women could be less likely to give their weight on a survey. |
|
Those with low education are not accounted for in a study. |
|
White cell value Data is missing because a testing machine was improperly calibrated. |